The video game market has only continued to grow over the years and is now spanning all genres. It has continued to grow and expand into a lot of different walks of life, from movies and TV-shows to even books/fan-fiction. With this growing market the various gaming publishers will want to capitalize on the interests of the various gamers within it and picking the right kind of game to develop and put out into the market could make or break a gaming publisher/studio.
The rise in the market also comes with a rise in gamer expectations for the type of video games they want to play, thus making games has continued to get more expensive as time progressed. This puts further strain on publishers if they develop the wrong game and it flops in the market, it becomes a huge waste of time, money and resources for them. Thus, creating a model based on past successes and failures in the gaming marketplace will help them to decide which type of game to make and where to market it for the best sales results possible.
I will evaluate the performance of multiple different models to determine which one performs the best with the dataset and the models I will be using is: Linear Regression, Random Forest, K-Nearest Neighbor (regression), and Decision Tree. I think it is important to utilize various different type of models so as to not limit the tools available to oneself for analysis. These models will be utilized through the scikit-learn implementations.
To ensure that the data will not have any data points that would otherwise skew the models in any particular way, i will be removing all records that have data that is more than three standard deviations away from the mean before i start my modeling.
I will determine how the different models perform with the data by splitting the dataset into training and testing sets and using the training set for my initial model analysis. I will evaluate the performance of each model using the adjusted R squared value, where the closer to 1 the better the model is. I will be using k fold cross validation to check to see if my models are under or over fitting my data and make changes accordingly.
After determining which model performs the best with the data, I will then use that model to predict the sales of video games using the "unseen" test data to see how it performs. The model prediction performance will also be evaluated using adjusted R squared, where the closer to 1 the better the predictions are.
Download Location: https://www.kaggle.com/gregorut/videogamesales
import warnings
import numpy as np
import pandas as pd
import seaborn as sb
import pandas_profiling as pp
from notebook import __version__ as nbv
# scipy Libraries
from scipy.stats import norm, stats
from scipy import __version__ as scipv
# matplotlib Libraries
import matplotlib.pyplot as plt
from matplotlib import __version__ as mpv
# plotly Libraries
import plotly.express as ex
from plotly import __version__ as pvm
# sklearn Libraries
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn import __version__ as skv
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
# Library Versions
lib_info = [('scipy', scipv), ('pandas', pd.__version__), ('numpy', np.__version__),
('plotly', pvm), ('sklearn', skv), ('seaborn', sb.__version__), ('matplotlib', mpv),
('pandas_profiling', pp.__version__), ('Jupyter Notebook (notebook)', nbv)]
print('Library Versions\n' + '='*16)
for name, vers in lib_info:
print('{:>27} = {}'.format(name, vers))
vgsData = pd.read_csv('Video_Game_Data/Video_Game_Sales.csv')
print("Dataset Dimensions: {:,} columns and {:,} rows".format(vgsData.shape[1], vgsData.shape[0]))
vgsData.head()
print("Describe Data:")
vgsData.describe()
fig = plt.figure()
fig.subplots_adjust(hspace=0.8, wspace=0.5)
fig.set_size_inches(13.5, 13)
sb.set(font_scale = 1.25)
warnings.filterwarnings('ignore')
i = 1
for var in vgsData.columns:
try:
fig.add_subplot(4, 2, i)
sb.distplot(pd.Series(vgsData[var], name=''), bins=100,
fit=norm, kde=False).set_title(var + " Histogram")
plt.ylabel('Count')
i += 1
except ValueError:
pass
fig.tight_layout()
warnings.filterwarnings('default')
# Video Game Sales Profiling Report
pp.ProfileReport(vgsData).to_notebook_iframe()
ex.pie(vgsData, names='Genre', title='Proportion of Global Video Game Sales by Genre')
We can see that the "Action" genre is the most popular type of video game by far.
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.5)
sales_list = ['Global_Sales', 'NA_Sales', 'EU_Sales', 'JP_Sales']
vid_plat_reg = vgsData.groupby('Platform')[sales_list].sum().sort_values(by='Global_Sales', ascending=False).reset_index()
color_labels = zip(sales_list, ['#016FB9', '#D1101D', '#71B340', '#FF9505'])
for i, j in color_labels:
sb.histplot(
vid_plat_reg[:25],
x='Platform',
weights=i,
multiple='stack',
edgecolor='white',
shrink=0.8,
bins=5,
label=i.split('_')[0],
color=j,
alpha=1
)
plt.xticks(rotation=45)
plt.legend(loc='upper right')
plt.title('Top 25 Video Game Sales by Platform and Region')
plt.ylabel('Sales (In Millions)')
sales_list = ['Global_Sales', 'NA_Sales', 'EU_Sales', 'JP_Sales']
top_10_pubs = vgsData.groupby(['Publisher'
])[sales_list].sum().sort_values(by='Global_Sales',
ascending=False).reset_index()['Publisher'].unique()[:10]
vid_year_pub = vgsData.groupby(['Year', 'Publisher'])[sales_list].sum().sort_values(by='Global_Sales',
ascending=False).reset_index()
vid_year_pub = vid_year_pub[vid_year_pub['Publisher'].isin(top_10_pubs)]
ex.area(vid_year_pub, x='Year', y="Global_Sales", color="Publisher",
title='Top 10 Video Game Publishers by Global Sales Per Year',
labels={"Global_Sales": "Sales (In Millions)", "Publisher": 'Publishers'})
plt.rcParams['figure.figsize'] = (16, 10)
sb.set(font_scale = 1.5)
sb.set_style(style='white')
sb.heatmap(vgsData.corr(), annot=True).set_title('Annotated Correlation Matrix of Sales Columns')
The matrix above makes it clear that both the Rank and Year columns have no significant relationship with any of the sales columns and thus they can be safely dropped from the dataset.
OneHotEncoded Genre column
df = pd.get_dummies(vgsData, prefix=['Genre'], columns=['Genre'], drop_first=True)
plt.rcParams['figure.figsize'] = (16, 10)
# sb.set(font_scale = 1.0)
sb.set_style(style='white')
sb.heatmap(df.corr(), annot=True).set_title('Annotated Correlation Matrix of Encoded Genre Columns')
From the matrix above we can see that all genre types have no significant relationships towards the sales columns, with the only exception being JP_sales and Genre_Role-Playing (0.16).
While the entirety (except for Japan) of the encoded genre columns don't have a lot of significance towards sales, it will be not be dropped from the dataset and be used during the modeling. The main reasoning being we cant have only sales predicting other sales and given the small number of genres, i don't see this consuming too much time to model
OneHotEncoded of the Top 20 (by Sales) Publishers
sales_list = ['Global_Sales', 'NA_Sales', 'EU_Sales', 'JP_Sales']
top_20_pubs = vgsData.groupby(['Publisher'
])[sales_list].sum().sort_values(by='Global_Sales',
ascending=False).reset_index()['Publisher'].unique()[:20]
df = pd.get_dummies(vgsData[vgsData['Publisher'].isin(top_20_pubs)],
prefix=['Publ'], columns=['Publisher'], drop_first=True)
plt.rcParams['figure.figsize'] = (13, 10)
plt.rcParams.update({'font.size': 7.2})
# sb.set(font_scale = 1.0)
sb.set_style(style='white')
sb.heatmap(df.corr(), annot=True).set_title('Annotated Correlation Matrix of Top 20 (by Sales) Encoded Publisher Columns')
While a bit hard to read, from the matrix above we can see that nearly all of the top 20 publishers have no significant relationships towards the sales columns, with the only notable exception being Publ_Nintendo across all of the sales columns: NA_Sales (0.21), EU_Sales (0.17), JP_Sales (0.41), and Global_Sales (0.25).
Since entirety (except for Japan) of the top 20 encoded publisher columns don't have a lot of significance towards sales, it will be dropped from the dataset and not be used during the modeling.
OneHotEncoded Platform Column
df = pd.get_dummies(vgsData, prefix=['Plat'], columns=['Platform'], drop_first=True)
plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams.update({'font.size': 5})
# sb.set(font_scale = 1.0)
sb.set_style(style='white')
sb.heatmap(df.corr(), annot=True).set_title('Annotated Correlation Matrix of Encoded Platform Columns')
While hard to read, from the matrix above we can see that all platform types have no significant relationships towards the sales columns, with the only exceptions being: JP_sales and Platform_GB (0.2), Platform_NES (0.23) and Platform_SNES (0.18).
Given that the entirety (except for Japan) of the encoded platform columns have no significance towards sales, it will be dropped from the dataset. There are also just too many columns to be included and it will greatly increase the amount of time it takes to train the various models.
vgsData.drop(['Rank', 'Name', 'Year', 'Publisher', 'Platform'], axis='columns', inplace=True)
vgsData.head()
print('Check for missing values in the dataset: {}'.format(vgsData.isnull().values.any()))
if vgsData.isnull().values.any():
old_size = len(vgsData)
vgsData = vgsData.dropna()
new_size = len(vgsData)
print('\nNumber of Rows that were Removed: {:,}'.format(old_size - new_size))
before = vgsData.shape[0]
print('Dataset size before outlier removal: {:,}'.format(before))
# Removes all records in the dataset that has data that is more than
# three standard deviations away from the mean of each column
vgsData = vgsData[(np.abs(stats.zscore(vgsData[vgsData.columns[1:]])) < 3).all(axis=1)].reset_index(drop=True)
after = vgsData.shape[0]
print(' Dataset size after outlier removal: {:>6,}\n'.format(after) + '='*43 +
'\n\t Total records removed: {:>6,}'.format(before - after))
vgsData = pd.get_dummies(vgsData, prefix=['Genre'], columns=['Genre'], drop_first=True)
vgsData.head()
plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams.update({'font.size': 13})
# sb.set(font_scale = 1.5)
sb.set_style(style='white')
sb.heatmap(vgsData.corr(), annot=True).set_title('Final Annotated Correlation Matrix with OneHotEncoded Genre Columns')
seed = 74 # Seed for train/test split and Model reproduction
x_train, x_test, y_train, y_test = train_test_split(vgsData[vgsData.columns.drop('Global_Sales')],
vgsData['Global_Sales'],
train_size=0.70,
random_state=seed)
print("X_train Dimensions: {:,} columns and {:,} rows".format(x_train.shape[1], x_train.shape[0]))
x_train.head()
print("Describe Data: X_train")
x_train.describe()
print("Y_train Dimensions: 1 column and {:,} rows".format(y_train.shape[0]))
y_train.head()
print("Describe Data: Y_train")
y_train.describe()
Based on the preprocessing and analysis above, i can see that the data has no missing or duplicated values which would need to be accounted for in the later analysis. There are some strong correlations between several of the variables in the dataset and there also some outlying data points that were found that were removed in order to prevent those data points from skewing analysis results.
I will be conducting regression modeling on all 4 of the models i outlined in the Proposal: Linear Regression, Random Forest, K-Nearest Neighbor and Decision Tree.
The adjusted R squared scores of all the models will be compared at the end to determine which model performed the best and thus, the model that did perform the best will be the model that is used to predict future graduate admissions chances.
For the hyperparameter selection of each model, the methodology i went with for deciding what parameters should be used in each GridSearchCV was to mainly focus on the different types of algorithms available per model and the growth of the model itself (more applicable to Knn, Random Forest and Decision tree). I took a look at the sklearn documentation and looked at all the parameters outlined for each model to it to determine which ones applied to my criteria. I felt that those parameters applicable to each model were the most important in helping to effectively determine the best parameters possible for each model.
While fine tuning using each and every parameter available for each model would be ideal, i felt that this would take too much time to fine tune them all and might not be the best use of my time because, in my opinion, not all of the parameters available are as important as others (thus, comparatively, not worth the time).
def adjusted_R_squared(estimator, x, y):
# estimator.score(x, y) returns the r squared of the model
return round(1 - (1 - estimator.score(x, y)) * ((x.shape[0] - 1) / (x.shape[0] - x.shape[1] - 1)), 5)
lr_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('lr', LinearRegression())
]))
param_grid = {'lr__fit_intercept': [True, False],
'lr__normalize': [True, False]}
lr_grid = GridSearchCV(lr_pipe, refit=True, scoring=adjusted_R_squared,
param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
lr_grid.fit(x_train, y_train)
lr_df = pd.DataFrame(lr_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
lr_df
print('Best Linear Regression Parameters\n=================================')
for name, val in lr_df.iloc[0]['params'].items():
print('{:>19}: {}'.format(name.replace('lr__', ''), val))
lr_adjR2 = lr_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(lr_adjR2, 4)))
rf_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('rf', RandomForestRegressor(random_state=seed))
]))
param_grid = {'rf__max_depth': np.arange(2, 10, 2),
'rf__max_features': ['auto', 'sqrt', 'log2'],
'rf__min_samples_leaf': [1, 2, 4],
'rf__min_samples_split': [2, 5, 8],
'rf__n_estimators': np.append(100, np.arange(200, 800, 200))}
rf_grid = GridSearchCV(rf_pipe, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
rf_grid.fit(x_train, y_train)
rf_df = pd.DataFrame(rf_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
rf_df
print('Best Random Forest Regression Parameters\n========================================')
for name, val in rf_df.iloc[0]['params'].items():
print('{:>24}: {}'.format(name.replace('rf__', ''), val))
rf_adjR2 = rf_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(rf_adjR2, 4)))
knn_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('knn', KNeighborsRegressor())
]))
param_grid = {'knn__n_neighbors': np.arange(1, 26, 2),
'knn__weights': ['uniform'],
'knn__algorithm': ['ball_tree', 'kd_tree', 'brute'],
'knn__leaf_size': np.arange(1, 26, 2),
'knn__p': [1, 2]}
knn_grid = GridSearchCV(knn_pipe, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
knn_grid.fit(x_train, y_train)
knn_df = pd.DataFrame(knn_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(100)
knn_df
print('Best Knn Regression Parameters\n==============================')
for name, val in knn_df.iloc[0]['params'].items():
print('{:>15}: {}'.format(name.replace('knn__', ''), val))
knn_adjR2 = knn_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(knn_adjR2, 4)))
dt_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('dt', DecisionTreeRegressor(random_state=seed))
]))
param_grid = {'dt__criterion': ['mse', 'friedman_mse', 'mae'],
'dt__splitter': ['best', 'random'],
'dt__max_features': ['auto', 'sqrt', 'log2'],
'dt__max_depth': np.arange(1, 12, 2),
'dt__min_samples_leaf': [1, 2, 4],
'dt__min_samples_split': [2, 5, 7],
'dt__ccp_alpha': [0.0, 1.0]}
dt_grid = GridSearchCV(dt_pipe, refit=True, scoring=adjusted_R_squared,
param_grid = param_grid, cv = 5, n_jobs = -1, verbose=2)
dt_grid.fit(x_train, y_train)
dt_df = pd.DataFrame(dt_grid.cv_results_).sort_values('mean_test_score',
ascending=False)[['params', 'mean_test_score']].head(10)
dt_df
print('Best Decision Tree Regression Parameters\n========================================')
for name, val in dt_df.iloc[0]['params'].items():
print('{:>23}: {}'.format(name.replace('dt__', ''), val))
dt_adjR2 = dt_df.iloc[0]['mean_test_score']
print('\nAjusted R Squared: {}'.format(round(dt_adjR2, 4)))
adj_R_sqaures = [lr_adjR2, rf_adjR2, knn_adjR2, dt_adjR2]
modelTypes = ['Linear Regression', 'Random Forest', 'K-Nearest Neighbor', 'Decision Tree']
model_r_df = pd.DataFrame(zip(modelTypes, adj_R_sqaures),
columns=['Model Type', 'Adj R Squared'])
model_r_df = model_r_df.nlargest(len(model_r_df), 'Adj R Squared')
model_r_df
After performing model analysis on all 4 of my model types and looking at the dataframe of adjusted R squared scores above, the model with the best performance with this data is Linear Regression (with Random Forest coming in a very close second).
print('Best Linear Regression Parameters\n' + '='*33)
params = {}
for name, val in lr_df.iloc[0]['params'].items():
name = name.replace('lr__', '')
params.update({name: val})
print('{:>19}: {}'.format(name, val))
print('\nAjusted R Squared: {}'.format(round(lr_df.iloc[0]['mean_test_score'], 4)))
best_pipe = Pipeline(steps=([
('scale', StandardScaler()),
('lr', LinearRegression(**params))
]))
best_model = best_pipe.fit(x_train, y_train)
best_model
y_pred = best_model.predict(x_test)
best_model_score = round(1 - (1 - r2_score(y_test, y_pred)) * ((y_test.shape[0] - 1) / (y_test.shape[0] - 1)), 4)
print("Best Linear Regression model score using test data\n" + '='*50 +
"\nAdjusted R Squared: {}".format(round(best_model_score, 4)))
print('\nDifference between experiment and best model adjusted R squared scores: {}'
.format(round(best_model_score - lr_adjR2, 4)))
Since the adjusted R squared value matches what got during my experiments, i am confident the model i have selected will perform well with future, unseen, data.
Based on all my analysis and experimentation, i am confident that the final model i have created is the best performing model for utilizing making predictions on future video game sales. I would say that the model could be improved by also utilizing the other columns that needed to be encoded (platform and publisher) but due to the limitations of my current hardware, modeling would simply take too long than would be feasible.
I would also recommend that using some kind of feature reduction method (PCA, SelectKBest, etc..) could also help in reducing the number of features and the length of modeling time at the cost of a small loss in scoring.